Responsible research
and reproducibility

2025 DSS Bootcamp

Colin Rundel

Some case studies

Bad spreadsheet merge kills depression paper, quick fix resurrects it

  • The authors informed the journal that the merge of lab results and other survey data used in the paper resulted in an error regarding the identification codes. Results of the analyses were based on the data set in which this error occurred. Further analyses established the results reported in this manuscript and interpretation of the data are not correct.

  • Original conclusion: Lower levels of CSF IL-6 were associated with current depression and with future depression […].

  • Revised conclusion: Higher levels of CSF IL-6 and IL-8 were associated with current depression […].

Study of social media retracted when authors can’t provide data

A business journal has retracted a 2016 paper about how social media can encourage young consumers to become devoted to particular brands, after discovering flaws in the data and findings.

  • Reasons for retraction:

    • Error in data
    • Error in results and/or conclusions
    • Results not reproducible

Heart pulls sodium meta-analysis over duplicated, and now missing, data

The journal Heart has retracted a 2012 meta-analysis after learning that two of the six studies included in the review contained duplicated data. Those studies, it so happens, were conducted by one of the co-authors.

The Committee considered that without sight of the raw data on which the two papers containing the duplicate data were based, their reliability could not be substantiated. Following inquiries, it turns out that the raw data are no longer available having been lost as a result of computer failure.

Clusterfake

Shu, Mazar, Gino, Ariely, & Bazerman (2012), “Signing at the beginning makes ethics salient….” PNAS - was a very influential paper that claimed to have found that signing a declaration of honesty at the beginning of a survey reduced dishonest self-reports. It involved multiple independent studies and was widely cited in the literature.

The paper was retracted in 2021 due to concerns about data fabrication and manipulation in one of the studies.

Subsequently, additional issues were discovered in a different study by a different author within the same paper.

Practice

Reproducibility in practice

  • Are the tables and figures reproducible from the code and data?

  • Does the code actually do what you think it does?

  • In addition to what was done, is it clear why it was done? (e.g., how were parameter settings chosen?)

  • Can the code be used for other data, especially future updates to the current data?

  • Can you extend the code to do other things?

Reproducibility in science

Ambitious goal

We need an environment where:

  • data, analysis, and results are tightly connected, or better yet, inseparable,

  • reproducibility is built in,

    • the original data remains untouched
    • all data manipulations and analyses are inherently documented
  • all procedures are human readable and understandable.

Donald Knuth “Literate Programming” (1983)

“Let us change our traditional attitude to the construction of programs: Instead of imagining that our main task is to instruct a computer what to do, let us concentrate rather on explaining to human beings what we want a computer to do.”

“The practitioner of literate programming […] strives for a program that is comprehensible because its concepts have been introduced in an order that is best for human understanding, using a mixture of formal and informal methods that reinforce each other.”

  • These ideas have been around for years!

  • Tools for putting them to practice have also been around.

  • They have never been as accessible as the current tools.

Reproducible data analysis stack


Scriptability

R / Python




Literate Programming

RMarkdown / Jupyter / Quarto

Version Control

Git / GitHub